Non-Concatenated FEC Codes for Ultra-High Speed Optical Transport Networks

ABSTRACT

A decoder performs forward error correction based on quasi-cyclic regular column-partition low density parity check codes. A method for designing the parity check matrix reduces the number of short-cycles of the matrix to increase performance. An adaptive quantization post-processing technique further improves performance by eliminating error floors associated with the decoding. A parallel decoder architecture performs iterative decoding using a parallel pipelined architecture.

RELATED APPLICATIONS

This application claims priority under 35 U.S.C. §119(e) to U.S.Provisional Patent Application Ser. No. 61/447,620 entitled“Non-Concatenated FEC Codes for Ultra-High Speed Optical TransportNetworks,” filed Feb. 28, 2011 by Damian Morero, et al., the content ofwhich is incorporated by reference herein.

BACKGROUND

1. Field of the Art

The disclosure relates generally to communication systems, and morespecifically, to forward error correction codes.

2. Description of the Related Art

Error correction is important component of applications such as opticaltransport networks and magnetic recording devices. For example, in thenext generation of coherent optical communication systems, powerfulforward error correction (FEC) codes are desirable to achieve high neteffective coding gain (NECG) (e.g., ≧10 dB at a bit error rate (BER) of10¹⁵). Given their performance and suitability for parallel processing,large block size low density parity check (LDPC) codes are a promisingsolution for ultra-high speed optical fiber communication systems.Because of the large block size to achieve high NECG, the use of lowcomplexity soft decoding techniques such as the min-sum algorithm (MSA)is often used when aiming at an efficient very large scale integration(VLSI) implementation. The main stumbling block for the application ofthis coding approach has been the fact that traditional LDPC codessuffer from BER error floors that are undesirably high.

The error floor is a phenomenon encountered in traditionalimplementations of iterated sparse graph-based error correcting codeslike LDPC codes and Turbo Codes (TC). When the bit error ratio (BER)curve is plotted for conventional codes like Reed Solomon codes underalgebraic decoding or for convolutional codes under Viterbi decoding,the curve steadily decreases as the Signal to Noise Ratio (SNR)condition becomes better. For LDPC codes and TC, however, there is apoint after which the curve does not fall as quickly as before. In otherwords, there is a region in which performance flattens. This region iscalled the error floor region.

To reduce these error floors, some decoders concatenate an LDPC codewith a hard-decision-based block code. However, this approach increasesthe overhead and reduces the performance and the spectral efficiency.

SUMMARY

In a first embodiment, a decoder is provided for forward errorcorrection using an LDPC code based on a parity check matrix comprisinga plurality of sub-matrices. The decoder comprises a plurality of checknode processing units (CNPU) and a plurality of variable node processingunits (VNPU). Each check node processing unit performs a check nodecomputation corresponding to a different row of the parity check matrix.Each variable node processing unit determines variable node updatecomputations corresponding to a different columns belonging to a samesub-matrix of the parity matrix. The plurality of check node processingunits and the plurality of variable node processing units operate ononly one sub-matrix of the parity check matrix at each step of aniterative decoding process. The decoder processes two or more codewordsin parallel such that decoder begins decoding a subsequently receivedcodeword prior to completing decoding for a prior received codeword. Thedecoder architecture beneficially improves the decoding process by, forexample, avoiding the penalty introduced by latency of the differentblocks inside the decoder

In a second embodiment, a method for forward error correction isprovided. A decoder receives a stream of low density parity checkcodewords. The decoder iteratively decodes the low density parity checkcodewords based on a parity check matrix. For each iteration of thedecoding, it is determined if an activation criteria is met. Responsiveto the activation criteria being met for an iteration of the decoding,the decoder is configured to adaptively quantize messages processed bythe decoder based on a scaling factor. This post-processing method basedon adaptive quantization beneficially improves the performance of lowdensity parity check decoding algorithms by contributing to reduction orelimination of the error floor.

In a third embodiment, a method is provided for generating aquasi-cyclic regular column-partition parity check matrix for forwarderror correction. An initial matrix H is received. A count of cycles ofthe initial matrix H is determined for each of a plurality of differentcycle lengths. A matrix Ĥ is created as copy of the initial matrix H.The matrix Ĥ is modified based on a modification algorithm. A count ofcycles of the modified matrix Ĥ is determined for each of the pluralityof different cycle lengths. A lowest of the plurality of different cyclelengths is determined for which the initial matrix H and the modifiedmatrix Ĥ have different counts. The initial matrix H is replaced withthe modified matrix Ĥ responsive to the initial matrix H having a highercount for the lowest of the plurality of different cycle lengths forwhich the initial matrix H and the modified matrix Ĥ have differentcounts.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention has other advantages and features which will be morereadily apparent from the following detailed description of theinvention and the appended claims, when taken in conjunction with theaccompanying drawings, in which:

Figure (FIG. 1) is a high level block diagram of an embodiment of acommunication system.

FIG. 2 is a high level block diagram of an embodiment of a decoder.

FIG. 3 is an example embodiment of a parity check matrix and Tannergraph for forward error correction.

FIG. 4 is an example embodiment of a cycle of a parity check matrix andTanner graph for forward error correction.

FIG. 5 is an example embodiment of various notations for a parity checkmatrix for forward error correction.

FIG. 6 is an example embodiment of a parity check matrix for forwarderror correction.

FIG. 7A is a first example embodiment of a parity check matrix forforward error correction using a short notation.

FIG. 7B is a second example embodiment of a parity check matrix forforward error correction using a short notation.

FIG. 8 is an example embodiment of process for determining reducingcycle length in a parity check matrix for forward error correction.

FIG. 9 is an example embodiment of a check node processing unit fordecoding LDPC codes.

FIG. 10 is an example embodiment of a decoder architecture for decodingLDPC codes.

FIG. 11A is an example embodiment of a first step of information flowthrough a decoder architecture for decoding LDPC codes.

FIG. 11B is an example embodiment of a second step of information flowthrough a decoder architecture for decoding LDPC codes.

FIG. 11C is an example embodiment of a third step of information flowthrough a decoder architecture for decoding LDPC codes.

FIG. 11D is an example embodiment of a fourth step of information flowthrough a decoder architecture for decoding LDPC codes.

FIG. 12A is a first performance graph for a set of example LDPC codes.

FIG. 12B is a second performance graph for a set of example LDPC codes.

FIG. 12C is a third performance graph for a set of example LDPC codes.

FIG. 13A is a first performance graph for a set of example LDPC codesusing adaptive quantization decoding.

FIG. 13B is a second performance graph for a set of example LDPC codesusing adaptive quantization decoding.

DETAILED DESCRIPTION Overview

A system operates using non-concatenated forward error correction (FEC)codes suitable for applications such as 100 Gb/s optical transportnetworks (OTN) or magnetic recording (MR) apparatuses. The systemoperates using a high-gain, very low error floor, long LDPC-only basedcode suitable for ultra-high speed fiber optical communication systems.The described system can, in some embodiments, achieve a net effectivecoding gain (NECG)>10 dB or better at a bit error rate (BER) of 10¹⁵with an overhead (OH) of ˜20% or better. Relative to prior systems thatuse concatenated codes, the non-concatenated codes described hereinachieve superior performance, lower latency, and lower overhead than theconcatenated codes.

To overcome potential performance issues due to BER floors, a lowdensity parity check (LDPC) code is described. Semi-analytic techniquesare combined with a study of dominant error events in order to determinea parity check matrix having high performance characteristics. Apost-processing method is also described (e.g., an adaptive quantization(AQ) post-processing method) that can effectively eliminate the BERfloors once they are pushed to sufficiently low levels in the paritycheck matrix design. An implementation of a decoder is also describedfor use with the LDPC codes and post-processing method.

The LDPC code and the hardware architecture of the decoder are jointlydesigned in order to (i) minimize (or sufficiently lower) the errorfloor and (ii) reduce the amount of memory and interconnectioncomplexity. For example, in one embodiment, a (24576, 20482) (i.e., 20%OH) QC-LDPC-only code with a 5-bit MSA decoder is used for 100 Gb/soptical systems. Under certain conditions, the described LDPC-only codecan achieve an NECG of, for example, 10.70 dB at BER of 10⁻¹³ with a13-iteration MSA decoder, while being free of error floors down

System Architecture

FIG. 1 is a block diagram of a communication system 100. Thecommunication system 100 comprises a transmitter 110 for transmittingdata to a receiver 120 via a communication channel 130. The transmitter110, receiver 120, and communication channel 130 may be of varioustypes, depending on the end application for the communications system100. For example, in one embodiment, the communication system 100comprises an ultra-high speed (e.g., 100 Gb/s or faster) optical fibercommunication system. In alternative embodiments, the communicationsystem 100 may comprise, for example, a microwave, radio frequency (RF),cable, or other type of communication system.

The communication channel 130 may be unreliable or noisy. Thus, the datareceived by the receiver 120 often contains errors (e.g., bit flips)relative to the transmitted data. The transmitter 110 and receiver 120therefore utilize an error correction technique that enables thereceiver 120 to detect, and in many cases, correct errors in the datareceived over the channel 130 from the transmitter 110.

The transmitter 110 receives input data 105 for transmission to thereceiver 120 via the communication channel 130. The transmitter 110includes an encoder 115 that encodes the data using forwarderror-correcting (FEC) codes. In one embodiment, a block coding schemeis used in which each block of binary input data is mapped to an FECcodeword. Generally, the FEC code provides some redundancy in the databy incorporating extra data symbols. For example, in one embodiment, theencoder applies a transform function to an input data block having ksymbols to generate an FEC code having n symbols, where n>k. Thisredundancy allows the receiver 120 to detect a limited number of errorsthat may occur in the transmitted data and in many cases to correct sucherrors. More specific details about the FEC codes are provided below.

In addition to the encoder 115, the transmitter 110 may comprise otherconventional features of a transmitter 110 which are omitted from FIG. 1for clarity of description. For example, the transmitter may includecomponents such as a modulator, a serial or parallel/serial converter, adriver or amplifier circuit, a laser source, etc.

The receiver 120 receives the data encoded as FEC codes from thetransmitter 110 via the communication channel 120. The receiver 120includes a decoder 125 that decodes the FEC codes data to attempt torecover the original data blocks. For example, in one embodiment, thedecoder 125 applies a parity check matrix H to a received FEC codewordhaving n symbols to recover a data block having k symbols where n>k.More specific details on the decoding technique is provided below.

In addition to the decoder 125, the receiver 120 may comprise otherconventional features of a receiver 120 which are omitted from FIG.1 forclarity of description. For example, the receiver 120 may includecomponents such as a demodulator, an analog-digital converter, amplifiercircuits, timing recovery circuits, an equalizer, various filters, etc.

Components of the transmitter 110 and the receiver 120 described hereinmay be implemented, for example, as an integrated circuit (e.g., anApplication-Specific Integrated Circuit (ASIC) or using afield-programmable gate array (FPGA), in software (e.g., loading programinstructions to a processor from a computer-readable storage medium andexecuting the instructions by the processor), or by a combination ofhardware and software.

FIG. 2 illustrates an example embodiment of a decoder 125. In thisembodiment, the decoder 125 iteratively decodes the received codewordsusing a decoding algorithm such as, for example, the sum-productalgorithm (SPA), the min-sum algorithm (MSA), or the scaled min-sumalgorithm (SMSA). In one embodiment, the decoder 125 comprises avariable-node processing unit (VNPU) 202, and a check-node processingunit (CNPU) 204. The VNPU 202, and/or CNPU 204 may each comprise aplurality of parallel processing units (e.g., q processing units). Thisallows for an efficient parallel decoding process as will be describedin further detail below. More specific examples of architectures for thedecoder 125 are described in FIGS. 9-11.

General LDPC Codes

In one embodiment, the communications system 100 uses low density paritycheck (LDPC) codes for forward error correction. An LDPC code is alinear block code defined as the null space of a sparse (m×n) paritycheck matrix H, where n represents the number of bits in the block and mdenotes the number of parity checks. The matrix H is considered “sparse”because the number of 1s is small compared with the number of 0s. Usingthe above definition, the set of LDPC codes

is defined as:

={c:Hc=0}  (1)

where c is an LDPC codeword in the set

. Note that each row of H provides a parity check on the codewords.Particularly, each row indicates that a linear combination of certainbits (specified by the 1s in the row) will add to zero for a validcodeword. Furthermore, an invalid codeword can often be corrected bycomparing results of multiple parity checks and determining a mostlikely location of the error(s).

Matrix H can be graphically represented using a Tanner graph 300 asillustrated in FIG. 3 for an example matrix H. The Tanner graph 300 is abipartite graph composed of two types of nodes: (1) variable or bitnodes v_(i) which represent the columns of H; and (2) the check nodesc_(j) which represent the rows of H. A connection between nodes v_(i)and c_(j) exists in the Tanner graph 300 if and only if H_(j,i)=1. Notethat there are not connections between two check nodes or between twobit nodes.

LDPC codes can be classified as “regular” or “irregular” based oncharacteristics of the matrix H. A matrix H is regular if it is bothrow-regular and column-regular. Otherwise, the matrix H is irregular.Matrix H is row-regular if ρ_(i)=ρ for all i where ρ_(i) is the numberof 1s in the i^(th) row of H. In other words, the matrix H isrow-regular if all rows have the same number of 1s. Similarly, Matrix His column-regular if γ_(i)=γ for all i where γ_(i) is the number of 1sin the i^(th) column of H. In other words, the matrix H iscolumn-regular if all columns have the same number of 1s.

For a given variable node v_(i) or check node c_(j), the number ofconnections to it determines its degree. If all v_(i) nodes have thesame degree γ and all c_(j) nodes the same degree ρ, then the LDPC codeis said to be a (γ, ρ)-regular LDPC.

A “cycle” in the Tanner graph 300 for a matrix H is a closed sequence(e.g., a loop) of connected nodes. FIG. 4 illustrates an example of acycle having a length 8. As can be seen, the series of connections formsa closed loop with a total of 8 links. Note that a Tanner graph for amatrix H may have multiple cycles. The “girth” of the Tanner graph for amatrix H is the length of the shortest cycle.

Quasi-Cyclic LDPC Codes

A cyclic matrix or “circulant” is a square matrix in which each row isthe cyclic shift of the row above it (i.e., the symbols in the row areright-shifted by one relative to the row immediately above it with thelast symbol in the row shifted to the first position), and the first rowis the cyclic shift of the last row. Furthermore, each column is thedownward cyclic shift of the column on its left (i.e., the symbols inthe column are down-shifted by one relative to the column immediate tothe left of it, with the last symbol in the column shifted to the firstposition), and the first column is the cyclic shift of the last column.

A characteristic of a circulant is that the row and column weights w arethe same, where the weight w of a row or column represents the number of1s in the row or column. Note that due to the characteristics of thecirculant, the row and column weights also give the number of non-zerodiagonals in the matrix. The weights w of the rows and columns of thecirculant can also generally be referred to as the weight of thecirculant. Note that if w=1, then the circulant is a permutation matrix,referred to as a circulant permutation matrix.

Another characteristic of a circulant is that the circulant can becompletely characterized by its first row (or first column). In otherwords, if the first row (or column) of the matrix is known, the rest ofthe matrix can be generated by applying appropriate shifts to thisvector based on the characteristics of the circulant defined above.Therefore, the first row (or column) is referred to herein as the“generator of the circulant.”

In quasi-cyclic LDPC codes, the parity check matrix H is an array ofsparse square circulant matrices of the same size. Observing that theLDPC code is given by the null-space of H, a set of quasi-cyclic LDPCcodes can be defined by the null space of an array of sparse squarecirculant matrices of the same size. Quasi-cyclic codes represent ageneralization of cyclic codes whereby a cyclic shift of a codeword by ppositions results in another codeword. Therefore, cyclic codes aresimply QC codes with p=1. QC-LDPC codes can beneficially perform veryclose to the Shannon limit and their cyclic properties reduce theimplementation complexity, and allow the use of efficient algebraictechniques to compute the code parameters and optimize the performance.

An example of a parity check matrix H for QC-LDPC codes is illustratedin FIG. 5 according to various notations. In this example, matrix H is a3×4 array of 3×3 circulants having varying weights of 0, 1, and 2. Asexplained above, the circulants are completely defined by the generator(first row) of each circulant. Therefore, the same matrix H can berepresented in a short notation given by H_(g) based on the generator.An even more compact notation for representing the matrix H is shown inmatrix H_(i). In this notation, each circulant is represented by avector defining the non-zero column positions of the generator of eachcirculant. As can be seen, the compact notation in H_(i) completelydefines the matrix H.

Regular Column Partition (RCP) QC-LDPC Codes

A regular column partition QC-LDPC (RCP-QC-LDPC) code is an LDPC codethat meets both the column-regular constraint and the quasi-cyclicconstraint described above. Let H be the (m×n) parity check matrix of anLDPC code. Assuming that n=μq with μ and q integers, the matrix H can bepartitioned into μ(m×q) sub-matrices:

H=[H ⁽⁰⁾ . . . H ^((r)) . . . H ^((μ−1))].   (2)

The parity check matrix H has the characteristic that the weights of therows and columns of each of the sub-matrices H^((r)) do not change withr. Thus, each sub-matrix of H is regular and the matrix H itself isregular.

Furthermore, in one embodiment, a type-p RCP-QC-LDPC code is used. Inthis type of code, each of the circulant sub-matrices H^((r)) has thesame rows weight p (i.e., the number of non-zero diagonals). For VLSIimplementation, a small value of p is often desirable since thecomplexity increases (at least) linearly with p. A high value of preduces the maximum girth of the code, which increases the error floorprobability.

FIG. 6 illustrates an example (4, 12)-regular matrix H that ispartitioned into μ=6 (4, 2)-regular sub-matrices (H⁽¹⁾ . . . H⁽⁶⁾),where q=4. While the example in FIG. 6 is illustrative of the propertiesof a general parity check matrix for an for RCP-QC-LDPC codes, a paritycheck matrix H may in practice be composed of significantly largersub-matrices H^((r)) and may have a larger number of sub-matrices(higher value of μ). For example, in one embodiment, a parity checkmatrix H for RCP-QC-LDPC codes comprises a 2×12 array of circulants eachhaving a size 2048×2048 and a weight of 2. Thus, in this embodiment, theparity check matrix H is configured for type-2 RCP-QC-LDPC codes and hasa column weight of 4. In one embodiment, the maximum girth of the paritycheck matrix H is eight.

FIG. 7A illustrates an example of a transposed version of H_(i)representing the compact notation for matrix H corresponding to a firstexample set of codes C₁. This first code set, C₁, is designed to avoidcycles of length 4. FIG. 7B illustrates an example of a transposedversion of H_(i) representing the compact notation for matrix Hcorresponding to a second example set of codes C₂. This second code set,C₂, is designed to (i) avoid cycles of length 4 and 6 (i.e. it achievesthe maximum girth) and (ii) minimize number of cycles of length 8. Aparity check matrix H having the characteristics described above willresult in a code length of 24576 symbols (e.g., bits) andk=n−rank(H)=20,482 symbols where k is the size of a decoded codeword. Inone embodiment, the RCP-QC-LDPC codes having the characteristics aboveis able to achieve a net effective coding gain (NECG) of 10 dB or higherat a bit error rate (BER) of 10⁻¹⁵. Furthermore, the RCP-QC-LDPC codeshave an overhead of about 20%.

In one embodiment, a variable-node partition with full check-nodeconnectivity (VPFCC) constraint in also imposed on the parity checkmatrix H. In this embodiment, all of the sub-matrices H^((r)) have onlya single 1 in each of its rows.

Reducing the Number of Short-Cycles of the Parity Check Matrix

In order to choose a specific parity check matrix H and an associatedset of RCP-QC-LDPC codes C, a technique may be used to find aquasi-cyclic parity check matrix having a low number of short-cycles. Anexample embodiment of a process for reducing the number of short-cyclesis illustrated in FIG. 8. The process can begin by receiving 802 aninitial parity check matrix H meeting RCP and quasi-cyclic constraintsdescribed above. A vector c is created 804 representing the number ofcycles of H of different lengths in order of increasing cycle length.For example, the number of cycles of H with lengths 4, 6, 8, . . . andso on are computed to create a histogram with each bin corresponding toa different cycle length. This computation can be implemented in severalways. For example, in one embodiment, a matrix A is denoted as theadjacency matrix of the Tanner Graph of H. It can be shown that the (i,j)-entry of A^(l) equals the number of paths of length l from node-i tonode-j (Theorem 1). Furthermore, it can be shown that in a graph withgirth δ, the nodes i and j are directly opposite each other in a δ-loopif and only if A_(i,j) ^(δ/2)≧2 and A_(i,j) ^(δ/2) ⁻² =0 (Theorem 2).Then, because the Tanner Graph of H is bipartite it contains onlyeven-length (δ even) cycles. Therefore, there are δ ordered pairs ofopposite nodes in a δ-cycle, i.e. there are entries on A_(i,j) ^(δ/2)that verified the constraints above and represent the same cycle. Notealso that each non-ordered pair of the A_(i,j) ^(δ/2) paths connecting iand j create a different loop, i.e. there are Φ(i, j)=A_(i,j)^(δ/2)(A_(i,j) ^(δ/2)−1)/2 different loops that contain nodes i and j asopposed nodes. Therefore, the number N of minimum length cycles is:

$\begin{matrix}{N = {\frac{1}{\delta}\Sigma_{i,j}{{I\left( {i,j} \right)} \cdot {\Phi \left( {i,j} \right)}}}} & (3)\end{matrix}$

where I(i, j) ε{0,1} is the indicator function which takes the value 1if Theorem 2 is verified for that particular entry of the adjacencymatrix or 0 elsewhere. In one embodiment, to speed up the computation ofN in Eq. (3), the polynomial representation of H over the ringZ[x]/(x^(L)−1) may be used. It also possible to modify Eq. (3) in orderto increase the penalty of the Φ(i,j) interconnected cycles. This may bedone, for example, by adding an exponential coefficient as [Φ(i, j)]^(w)with w>1 or in general by replacing (i, j) by ƒ(Φ(i, j)) for somenon-decreasing function ƒ(·). Since absorbing sets are usually createdby the interconnection of several short length cycles, this variationmay help to further reduce the probability of an error floor.

A copy of H is then created 806 and denoted Ĥ. Ĥ is then modified 808according to a modification algorithm while maintaining the samequasi-cyclic constraint. For example, in one embodiment, one of thecyclic sub-matrices of Ĥ is chosen based on a pseudo-random algorithmand the position of one of its diagonals is changed to a differentlocation which is also chosen based on a pseudo-random algorithm. Thisstep 808 may be repeated several times before continuing to thefollowing step. This technique results in a random walk over theparameters of a quasi-cyclic parity check matrix. In alternativeembodiments, a different technique could be used to modify Ĥ.

A vector ĉ is then created 810 representing the number of cycles of Ĥ ofeach length in order of increasing cycle length. For example, the numberof cycles of Ĥ with lengths 4, 6, 8, . . . and so on are computed tocreate a histogram with each bin corresponding to a different cyclelength. This computation can performed using a similar algorithm asdescribed above. A vector is computed 812 as d=c−ĉ. If at decision block814, the first non-zero element in d is positive, then H is replaced 816with Ĥ. Otherwise, the matrix H is kept 818. Note that the comparison ofthe number of cycles between c and ĉ occur in increasing order of cyclelength. Thus, for example, the cycles of length 4 are compared first; ifthey are equal the cycles of length 6 are compared and so on.Optionally, if further optimization is desired, the process may returnto step 802 and repeat for any number of iterations (e.g., a fixednumber of iterations or until a stopping criterion is met).

Iterative Decoding Algorithms

As data blocks are received by the decoder 125, the decoder decodes thedata blocks and applies the parity check matrix H to recover thetransmitted data. In one embodiment, the decoder 125 may apply, forexample, a sum-product algorithm (SPA), a min-sum algorithm (MSA), or ascaled min-sum algorithm (SMSA) to decode the received data blocks.

Let b_(i) and x_(i) be the i-th bit of the codeword and thecorresponding channel output respectively. The input to the decoder 125is the prior log-likelihood ratio (LLR) L^(a) _(i) defined by:

$\begin{matrix}{{L_{i}^{a} = {\ln \left( \frac{P_{a}\left( {b_{i} = {0x_{i}}} \right)}{P_{a}\left( {b_{i} = {1x_{i}}} \right)} \right)}},} & (4)\end{matrix}$

where P_(a)(·) denotes the a-priori probability of the bit b_(i). Thus,L^(a) _(i) represents an initial likelihood of the input bit i being a 0or a 1. Then, an iterative decoding procedure between variable and checknodes is carried out as follows:

$\begin{matrix}{{L_{v_{i}->c_{j}}^{e} = {L_{i}^{a} + {\sum\limits_{c_{k} \in {C^{(v_{i})}\backslash c_{j}}}L_{c_{k}->v_{i}}^{e}}}},} & (5) \\{{L_{c_{j}->v_{i}}^{e} = {\varphi^{- 1}\left\{ {\sum\limits_{v_{k} \in {V^{(c_{j})}\backslash v_{i}}}{\varphi \left\lbrack L_{v_{k}->c_{j}}^{e} \right\rbrack}} \right\}}},} & (6)\end{matrix}$

where C^((v) _(i) ⁾={c_(j):H_(j,i) ≠0}, V ^((c) _(j)⁾={v_(i):H_(j,i)≠0}, ø(x)=ln[tanh (x/2], and ø⁻¹(x)=2 tanh⁻¹(e^(x)). Theposterior LLR is computed in each iteration by

$\begin{matrix}{L_{i}^{o} = {L_{i}^{a} + {\sum\limits_{c_{k} \in C^{(v_{i})}}L_{c_{k}->v_{i}}^{e}}}} & (7)\end{matrix}$

Hard decisions are derived from (7). The iterative decoding process iscarried out until hard decisions satisfy all the parity check equationsor when an upper limit on the iteration number is reached.

The decoding algorithm can be understood in view of the Tanner Graph(see e.g., FIG. 3). Here, the algorithm can be represented as thepassing messages between the variable nodes and the check nodes of theTanner Graph as described in the equations above. In the equations(4)-(7), L_(v) _(i) _(→c) _(j) ^(e) is the extrinsic information sent bythe variable node ‘i’ to the check node ‘j’. It represents an estimationof the probability of the bit ‘i’ being a ‘0’ or ‘1’ given the a prioriinformation L^(a) _(i) and the information coming from all other checknodes connected to the variable node ‘i’ except that coming from thecheck node ‘j’. L_(c) _(j) _(→v) _(i) ^(e) is the extrinsic informationsent by the check node ‘j’ to the variable node ‘i’. It represents anestimation of the probability of the bit ‘i’ being a ‘0’ or ‘1’ giventhe information coming from all the other variable nodes connected tothe check node ‘j’ except that coming from the variable node ‘i’.

The computation of (5) and (7) are performed by the VNPU 202 of thedecoder 125 and the computation of (6) is performed by the CNPU 204 ofthe decoder 125. Since the CNPU 204 consumes most of the computationalrequirements of the above-described decoding algorithm, a simplifiedexpression of (6) may be implemented:

$\begin{matrix}{{\hat{L}}_{c_{j}->v_{i}}^{e} = {\min\limits_{v_{k} \in v^{({c_{j)}\backslash v_{i}}}}{{L_{v_{k}->c_{j}}^{e}} \cdot {\prod\limits_{v_{k} \in v^{{(c_{j})}\backslash v_{i}}}{{{sign}\left( L_{v_{k}->c_{j}}^{e} \right)}.}}}}} & (8)\end{matrix}$

This approach is called the min-sum algorithm (MSA). To reduce theapproximation error of (8), another modification can optionally beemployed called the scaled min-sum algorithm (SMSA). The check nodecomputation performed by the CNPU 204 in SMSA is given by:

$\begin{matrix}{{\hat{L}}_{c_{j}->v_{i}}^{e} = {\alpha \cdot {\min\limits_{v_{k} \in v^{({c_{j)}\backslash v_{i}}}}{{L_{v_{k}->c_{j}}^{e}}.{\prod\limits_{v_{k} \in v^{{(c_{j})}\backslash v_{i}}}{{sign}\left( L_{v_{k}->c_{j}}^{e} \right)}}}}}} & (9)\end{matrix}$

with α being a factor smaller than unity (e.g., α≈0.75).

To further reduce the implementation complexity, computation of Eq. (9)can divided into a series of steps represented by equations (10A) to(10E) which are implemented by the CNPU 204:

$\begin{matrix}{{\hat{L}}_{c_{j}->v_{i}}^{e} = \left\{ \begin{matrix}{\alpha \cdot M_{j,i}^{(1)} \cdot S_{j,i} \cdot {{sign}\left( L_{v_{i}->c_{j}}^{e} \right)}} & {{{if}\mspace{14mu} v_{i}} \neq v_{j,i}^{(1)}} \\{\alpha \cdot M_{j,i}^{(2)} \cdot S_{j,i} \cdot {{sign}\left( L_{v_{i}->c_{j}}^{e} \right)}} & {{{if}\mspace{14mu} v_{i}} = v_{j,i}^{(1)}}\end{matrix} \right.} & \left( {10A} \right) \\{M_{j,i}^{(1)} = {\min\limits_{v_{k} \in V^{(c_{j})}}{L_{v_{k}->c_{j}}^{e}}}} & \left( {10B} \right) \\{M_{j,i}^{(2)} = {\min\limits_{v_{k} \in {V^{(c_{j})}\backslash v_{j,i}^{(1)}}}{L_{v_{k}->c_{j}}^{e}}}} & \left( {10C} \right) \\{v_{j,i}^{(1)} = {\arg \left\{ {\min\limits_{v_{k} \in V^{(c_{j})}}{L_{v_{k}->c_{j}}^{e}}} \right\}}} & \left( {10D} \right) \\{S_{j,i} = {\prod\limits_{v_{k} \in V^{(c_{j})}}{{sign}\left( L_{v_{k}->c_{j}}^{e} \right)}}} & \left( {10E} \right)\end{matrix}$

FIG. 9 illustrates an embodiment of a CNPU 1004 for processing twocodewords at the same time according to Eqs. (10A)-(10E) above. In thisarchitecture, Eq. (10A) is computed by the output computation unit 910,Eqs. (10B), (10C), and (10D) are computed by the Minimum Computationunit 902, and Eq. (10E) is computed by the Sign Product Computation Unit904. The Message Memory 1 906 and Message Memory 2 908 save the resultsof equations (10B)-(10E) as described below.

The minimum computation unit 902 computes the minimum value (called thefirst minimum value) of the absolute value of L_(v→c) ^(e) as indicatedin Eq. (10B). The minimum computation unit 902 also determines whichvariable node corresponds to this minimum value as described in Eq.(10D). Furthermore, the minimum computation unit 902 computes theminimum value (called the second minimum value) of the absolute valuesL_(v→c) ^(e) but without taking into account the message coming from thevariable node which corresponds to the first minimum value as describedin Eq. (10C). In other words, the minimum computation unit 902determines the two lowest absolute values of the input messages from theset of variable nodes and the variable nodes that these messages camefrom. The sign product computation unit 904 determines the product ofthe signs of L_(v→c) ^(e) as indicated in Eq. (10E) above. The outputsof the minimum computation unit 902 and the sign product computationunit 904 are stored to the pipelined message memory 1 906 and messagememory 2 908. A sign FIFO unit 912 stores the signs of the inputmessages L_(v→c) ^(e) to be used later by the output computation unit910. The output computation unit 910 combines the values stored in thesign FIFO unit 912 and the memory message 908 according to Eq. (10A)above and outputs the result L_(c→v) ^(e). Operation of the CNPU 1004 inconjunction with a parallel decoder implementation is described infurther detail below.

Parallel Implementation of Iterative Decoding Algorithm

The constraint imposed by RCP allows an efficient partial parallelimplementation of the decoding algorithm. An example embodiment of aparallel pipelined decoding architecture is illustrated in FIG. 10 forthe example case where q=4 as in the matrix H of FIG. 6. The decoder 125includes a first-in-first-out (FIFO) memory 1016 that stores thea-priori LLRs, permutation blocks Π⁻¹ 1012 and Π⁻¹ 1014 , parallel VNPUs1002, serial, parallel, or semi-parallel CNPUs 1004, a control unit1022, and multiplexers 1018, 1020. The permutation blocks 1012, 1014 canbe implemented with multiplexers (if the permutation is not constant,i.e. the sub-matrices H^((r)) are not equal) or they can be implementedas wires. The control unit 1022 generates control signals utilized bythe other blocks of the decoder 125. In particular, the control unit1022 controls the permutation blocks 1012, 1014, and turns on and offpost-processing algorithms (which are implemented by the VNPUs 1002 orthe CNPUs 1004) that will be described in further detail below. Thecontrol unit 1022 also controls the computations and memories inside theCNPUs 1004 and controls the select lines of the multiplexers 1018, 1020.

Each iteration of the iterative decoding algorithm is divided into μsteps with each step corresponding to one of the sub-matrices of H. Atthe r-th step, only the messages related to the sub-matrix H^((r)) arecomputed. Thus, for example, at a first step (r=0), the decoder 125receives LLRs from Eq. (4) corresponding to the first q bits (e.g., q=4)of a codeword (e.g., bits corresponding to v₁, v₂, v₃, v₄ of the firstsub-matrix H⁽⁰⁾). The multiplexer 1018 and permutation block 1012operate to select the appropriate inputs to each of the CNPUs 1004 toperform the computation of Eq. (8), (9) or (10A)-(10E) (depending on theparticular implementation used). In one embodiment, the permutationblock 1012 comprises a barrel shifter. The CNPUs 1004 perform the checknode computation of Eqs. (8), (9), or (10A)-(10E) with each CNPU 1004corresponding to a different parity check (row of H^((r)) for thesub-matrix r being processed). In this embodiment, eight CNPUs 1004operate in parallel corresponding to each of the rows (check nodes) ofH. In one embodiment, the number of input messages L_(v) _(i) _(→c) _(j)^(e) and output messages L_(c) _(j) _(→v) _(i) ^(e) that each CNPU 1004can compute per clock cycle is equal to the number of ‘1s’ in thecorresponding row of the sub-matrix being processed. If the CNPU 1004computes only one input and one output messages per clock cycle it iscalled a serial CNPU. If it computes more than one (but lower than thetotal number of 1s in the corresponding row of H) input and outputmessages per clock cycle it is called a semi-parallel CNPU. Furthermore,in one embodiment, each CNPU 1004 can operate on two different receivedcodewords at a time using, for example, the CNPU architecture of FIG. 9described above. For example, in one embodiment, the minimum computationunit 902 and the sign product computation unit 904 of FIG. 9 can operateon one codeword while the output computation unit 910 operates on adifferent codeword. The CNPU supports two different codewords becausethe minimum computation unit 902 and the output computation unit 910 areisolated by the implementation of the two message memories 906 and 908.

Inverse permutation block 1014 (e.g., a barrel shifter) receives theoutputs of the CNPUs 904 and provides appropriate inputs to the VNPUs1002 for carrying out the computation of Eq. (5) and Eq. (7). In oneembodiment, the decoder 125 has q parallel VNPUs 902 (e.g., q=4)corresponding to the q columns (variable nodes) of each sub-matrix of H.In one embodiment, the complexity is reduced because only q (and not n)parallel VNPUs 1002 are implemented, i.e., it is not necessary toimplement one VNPU per variable node. Multiplexer 1020 provides LLRvalues to FIFO register 1016 which outputs these to the VNPUs 1002 atthe appropriate time to compute Eq. (5) and Eq. (7). Feedback paths1024, 1026 provide intermediate values to the beginning of the pipelineto perform additional iterations of the iterative decoding process.

The decoder architecture of FIG. 10 beneficially allows the decoder 125to reuse the same q VNPUs 1002 at each step, reducing μ times theassociated hardware. Furthermore, the interconnection complexity is alsoreduced because the interconnection network is associated with thenon-zeros entries of H^((r)), which is μ times smaller than that of theoriginal H. The locks of the CNPU 904 are simplified since the recursivecomputation of the check node equation (8) has significant lowercomplexity than a full-parallel implementation. Furthermore, therecursive CNPU 1004 stores only two minimum values which are the outputsof equations (10B) and (10C), in one embodiment. Therefore, it is notnecessary to store all L^(e) messages. This reduces the memoryrequirements of the decoder 125.

In one embodiment, the decoder architecture of FIG. 10 efficientlyperforms the iterative decoding process by processing multiplecodewords. For example, rather than processing all of the iterations ofone codeword and then going to the next codeword, the decoder insteadprocesses one iteration of a first codeword, then one iteration of asecond codeword and so on up to an N^(th) codeword. Then, the decoderprocesses the next iteration of the first codeword, and so on. Note thattwo codewords can be processed at the same time by different blocks ofthe decoder (for instance, the minimum computation unit 902 and theoutput computation unit 910 can process different codewords at the sametime). This modification can be combined with early termination (i.e., avariable number of iterations is performed on each codeword depending onthe outcome of the parity check). In this embodiment, when the decodingprocess of one of the N codewords is completed, a new codeword canreplace it while the other codewords continue the decoding process.Thus, the decoder need not necessarily wait until all the N codewordsare decoded in order to introduce new codewords to the decoder.

For example, when the multiplexers 1018, 1020 close the decoder loop,there may be two codewords stored in the decoder: e.g., codeword A andcodeword B. The output computation unit 910 of the CNPU 1004 read theinformation of the codeword A from the message memory 908 (see FIG. 9)and computes the messages L_(c) _(j) _(→v) _(i) ^(e) of codeword A.These messages are passed to the VNPU 1002 through the permutation block1014. The VNPU 1002 computes the messages L_(v) _(i) _(→c) _(j) ^(e) ofcodeword A. These messages return to the CNPU 1004 through themultiplexer 1018 and the permutation block 1012. All these blocks (1014,1002, 1018, 1012) may introduce a latency (for example, due to theirpipeline implementation). Because of this latency, the minimumcomputation unit 902 does not start processing until the new L_(v) _(i)_(→c) _(j) ^(e) arrived. Therefore, if the decoder 125 supported onlyone codeword, the output computation unit 910 may finish before theminimum computation unit 902 has finished and the output computationunit 910 would have to wait for the minimum computation unit 902 tofinish its process. This waiting time would reduce the computation speedof the decoder 125. In order to avoid this penalty, two (or more)codewords are stored in the decoder loop. As soon as the outputcomputation unit 910 finishes the computation process of one codeword,for example codeword A, it can start with the computation of the othercodeword, for example codeword B, which was stored in the message memory906. This is done by copying the contents of memory 906 to memory 908.Later, when the minimum computation unit 902 finishes its computationprocess of the codeword A, it stores the results in the message memory906 and it can immediately starts the computation process of codeword B.If the total latency of blocks 1014, 1002, 1018, 1012, and 904 is higherthan the number of sub-matrices, more than two codewords may be decodedand stored at the same time in order to avoid the above describedwaiting time. This can be done by increasing the number of messagememories (in a serial FIFO concatenation) inside the CNPU 1004.

FIG. 11A-D illustrate flow of information through the pipelined decoderarchitecture of FIG. 10 in which the decoder processes two codewords inparallel. The CNPUs 1004 showed in FIG. 11A-D are divided in 3sub-blocks. The first (left) sub-block corresponds to the minimumcomputation unit 902, the sign product computation unit 904, and part ofthe FIFO unit 912 showed in FIG. 9. The second (center) sub-blockcorresponds to the message memory 1 906 showed in FIG. 9. The third(right) sub-block corresponds to the message memory 2 908, the outputcomputation unit 910 and part of the FIFO 912. In FIG. 11A, a firstiteration of a first codeword (e.g., q LLR) is passed in μ clock cyclesto the CNPUs 1004 and enter the FIFO register 1016. After that, in FIG.11B, the first iteration of the first codeword moves forward in theinternal pipelines of the CNPUs 1004 and a first iteration of a secondcodeword is passed to the CNPUs 1004. Furthermore, the first iterationof the second codeword enters the FIFO register 1016 and the firstiteration of the first codeword moves forward in the FIFO register 1016.After that, in FIG. 11C, the first iteration of the first codeword ispassed from the CNPUs 1004 to the VNPUs 1002. The first iteration of thesecond codeword moves forward in the CNPU 1004 pipelines and in the FIFOregister 1016. A second iteration of the first codeword enters the CNPU1004 and the FIFO register 1016. After μ clock cycles, in FIG. 11D, thefirst iteration of the second codeword is passed from the CNPUs 1004 tothe VNPUs 1002. The second iteration of the first codeword moves forwardin the CNPU 1004 pipeline and FIFO register 1016. A second iteration ofthe second codeword enters the CNPU 1004 and FIFO register 1016. As willbe apparent, the process described above can be extended to N codewordsfor any integer N.

EXAMPLE PERFORMANCE MEASUREMENTS

In one embodiment, performance of the LDPC codewords can be evaluatedusing a combination of analytical tools and simulation (e.g., using afield-programmable gate array or other device). For example, in oneembodiment, simulations in the proximity of the low BER region ofinterest (e.g., {tilde under (>)}10⁻¹³) could be used to obtain dominanttrapping sets. Based on these trapping sets, BER can be estimated byusing importance sampling technique.

Let r_(a) and r_(e) be the number of bits used to represent the priorLLRs and the messages from both check and variables nodes, respectively.In one embodiment, the prior LLRs are quantized (e.g., using r_(a)=5bits). Furthermore, in one embodiment, the decoder is implemented usingr_(e)=5 bits. To obtain the perform measures, an all-zeros codeword canbe transmitted using binary phase-shift keying (BPSK) modulation (i.e.,bits {0, 1} are mapped into symbols {+1, −1} for transmission). Anadditive white Gaussian noise (AWGN) channel can also be implemented tomodel channel noise by using a random number generator.

FIGS. 12A-12C illustrate performance results for an exampleimplementation of the decoder 125. FIG. 12A depicts the BER versus thesignal-to-noise ratio (SNR) for code C₁ described above with 13iterations of the decoding algorithm. The curve shows an error floor atBER=10⁻¹¹ with an NECG of 9.45 dB at BER=10⁻¹⁵. This error floor iscaused by the presence of several absorbing sets (AS) created by thecombination of cycles of length 6.

FIG. 12B shows the performance of code C₂ described above. No errorfloor is observed up to 10⁻¹³ and the expected NEGC is 11.30 dB.However, from importance sampling analysis, a quantization sensitiveerror floor below 10⁻¹³ can be estimated. This error floor is caused bythe combination of a (12,8) absorbing set and the quantization of theL^(e) messages in the SMSA.

FIG. 12B shows the estimated error floor for r_(e)=5, 6, and 7 bits with13 iterations. The a-posteriori LLR evolution of the (12, 8) absorbingset is shown in FIG. 12C. (In the notation, “(e, d) absorbing set, e isthe number of wrong bits and d is the number of unsatisfied checknodes). Note that the SMSA decoder with r_(e)=5 bits does not resolvethe (12,8) absorbing set independently of the number of iterations. Onthe other hand, the SMSA decoder takes 17 and 12 iterations with r_(e)=6and 7 bits, respectively.

Min-Sum Algorithm With Adaptive Quantization

A common problem with decoders based on SPA, MSA or its variations isthat error floors tend to arise. These error floors can be challengingto estimate and reduce particularly at very low levels (e.g., below10⁻¹³). As shown above, very low error floors (e.g., below 10⁻¹³) may becaused by quantization effects. In order to effectively combat these lowerror floors, a post-processing technique may be applied.

In one embodiment, the performance limitations described above can beimproved using a real-time adaptive quantization scheme. The real-timeadaptive quantization scheme combats error floor exacerbation caused bya low precision implementation of the decoder 125. The decoder 125applies real-time adaptation of the fractional point position in thefixed point representation of the internal MSA messages, keepingconstant the total number of bits.

The adaptive quantization algorithm applies a scaling to thelog-likelihood ratios (LLRs) and messages in order to increase the rangeof representation, and therefore reduce the saturation effects. In oneembodiment, this scaling step is activated only when a predefinedactivation condition is met. For example, in one embodiment, the scalingis applied when the number of unsatisfied check nodes do not exceed aminimum value d_(t) (e.g., d=8 or d=9 for a (12, 8) absorbing set). Acheck node is unsatisfied if its corresponding parity equation (i.e. arow of the parity check matrix H) is unsatisfied according to the signvalue of the a posteriori output of the decoder at that state. Thisactivation condition usually occurs only after some normal iterationswithout scaling. Note that since the total number of bits is maintainedconstant, this wider range is obtained at the expense of an increase inquantization.

The fixed-point modification increases the dynamical range of thedecoder messages (by increasing the quantization step). In oneembodiment, the quantization change is implemented in the VNPU 1002after summation because here messages achieve their highest value, andthe saturations have the stronger distortion effect.

The scaling step can be generalized and implemented inside the VNPU 1002as:

$\begin{matrix}{{L_{v_{i}->c_{j}}^{e} = {{\kappa_{1}^{t} \cdot L_{i}^{a}} + {\kappa_{2} \cdot \left( {\sum\limits_{c_{k} \in {c^{(v_{i})}\backslash c_{j}}}L_{c_{k}->v_{i}}^{e}} \right)}}},} & (11)\end{matrix}$

where t=1, 2, . . . , denotes the number of the extra iteration used forpost-processing. Factors κ₁ and κ₂ are positive gains smaller thanunity.

In one embodiment, to simplify the implementation, κ₁=κ₂=κ can be used.Thus, the algorithm reduces to scale by κ both the output of thevariable-node equation (Eq. (5)) and the prior LLR. This is shown as:

$\begin{matrix}{{L_{v_{i}->c_{j}}^{e} = {\kappa \cdot \left( {L_{i}^{a} + {\sum\limits_{c_{k} \in {c^{(v_{i})}\backslash c_{j}}}L_{c_{k}->v_{i}}^{e}}} \right)}},} & (12) \\\left. L_{i}^{a}\leftarrow{\kappa \cdot {L_{i}^{a}.}} \right. & (13)\end{matrix}$

Note that the prior information is gradually reduced to zero as theadaptive quantization process evolves. After a given number ofiterations, the MSA operates without prior information. In oneembodiment, κ=½ provides a good tradeoff between performance andimplementation complexity.

FIGS. 13A-13B illustrate performance of the decoder using the adaptivequantization algorithm. In FIG. 13A, the a-posteriori LLR evolution ofthe SMSA decoder for code C₂ is shown with r_(c)=5 bits over the (12,8)absorbing set. Note that the absorbing set can be resolved with 5-6extra iterations. FIG. 13B shows the estimated BER versus SNR derivedwith r_(e)=5 bits and the adaptive quantization algorithm. As can beseen, the error floor observed in FIG. 12B is corrected by the adaptivequantization algorithm.

Using the RCP-QC-LDPC codes and the adaptive quantization techniquedescribed above, the complexity of the decoding can be substantiallyreduced (e.g., to about 5 extra iterations under some conditions.Furthermore, the error floors can be drastically lowered, resulting inan expected NECG≧11.30 dB or better at a BER of 10⁻¹⁵ in someembodiments. Furthermore, the described approach beneficially avoids ahard-decision-based block outer code and reduces the block sizesignificantly relative to prior techniques. This reduction of complexityand the concomitant reduction of latency can be an important factor forcommercial applications, thereby enabling applications such as 100 Gb/soptical transport networks.

In some embodiments, these codes can achieve an expected coding gain of,for example, 11.30 dB at 10⁻¹⁵, 20% OH, and a block size of 24576 bits.The described code beneficially can minimize the BER floor whilesimultaneously reducing the memory requirements and the interconnectioncomplexity of the iterative decoder. Under certain conditions, thedescribed codes can achieve NECG of 10.70 dB at a BER of 10⁻¹³ and noerror floors.

Although the detailed description contains many specifics, these shouldnot be construed as limiting the scope of the invention but merely asillustrating different examples and aspects of the invention. It shouldbe appreciated that the scope of the invention includes otherembodiments not discussed in detail above. For example, thefunctionality has been described above as implemented primarily inelectronic circuitry. This is not required, various functions can beperformed by hardware, firmware, software, and/or combinations thereof.Depending on the form of the implementation, the “coupling” betweendifferent blocks may also take different forms. Dedicated circuitry canbe coupled to each other by hardwiring or by accessing a common registeror memory location, for example. Software “coupling” can occur by anynumber of ways to pass information between software components (orbetween software and hardware, if that is the case). The term “coupling”is meant to include all of these and is not meant to be limited to ahardwired permanent connection between two components. In addition,there may be intervening elements. For example, when two elements aredescribed as being coupled to each other, this does not imply that theelements are directly coupled to each other nor does it preclude the useof other elements between the two. Various other modifications, changesand variations which will be apparent to those skilled in the art may bemade in the arrangement, operation and details of the method andapparatus of the present invention disclosed herein without departingfrom the spirit and scope of the invention as defined in the appendedclaims. Therefore, the scope of the invention should be determined bythe appended claims and their legal equivalents.

1. A decoder for decoding forward error correcting codewords using aparity check matrix comprising a plurality of sub-matrices, the decodercomprising: a plurality of check node processing units, each check nodeprocessing performing a check node computation corresponding to adifferent row of the parity check matrix; and a plurality of variablenode processing units, each variable node processing unit determiningvariable node update computations corresponding to different columnsbelonging to a same sub-matrix of the parity check matrix; wherein theplurality of check node processing units and the plurality of variablenode processing units operate on only one sub-matrix of the parity checkmatrix at each step of an iterative decoding process; and wherein thedecoder processes two or more codewords in parallel such that thedecoder begins decoding a subsequently received codeword prior tocompleting decoding of a prior received codeword.
 2. The decoder ofclaim 1, wherein the decoder is configured to apply an iterativedecoding algorithm and wherein the decoder is configured to process afirst iteration of the prior received codeword, and process a firstiteration of the subsequently received codeword prior to processing asecond iteration of the prior received codeword.
 3. The decoder of claim1, wherein each of the check node processing units comprises: a parallelpipelined processing architecture for processing two or more codewordsat a time.
 4. The decoder of claim 1, wherein each of the check nodeprocessing units comprises: a minimum computation unit determining firstand second minimum values of input messages received from a plurality ofvariable nodes and determining a variable node corresponding to thefirst minimum value; a sign product computation unit determining aproduct of signs of the input messages received from the plurality ofvariable nodes; a plurality of pipelined message memories storing thedetermined first and second minimum values, an identifier of thedetermined variable node corresponding to the first minimum value, andthe product of the signs; a sign FIFO unit storing the signs of theinput messages from the plurality of variable nodes; and an outputcomputation unit determining an output message based on the signs of theinput messages from the sign FIFO unit and values stored in theplurality of message memories.
 5. The decoder of claim 1, wherein theparity check matrix is row-regular and column-regular such that theparity check matrix has a first same number of 1s in each row and asecond same number of 1s in each column.
 6. The decoder of claim 1,wherein the parity check matrix is quasi-cyclic such that a circularshift of a valid codeword by an integer amount results in another validcodeword.
 7. The decoder of claim 6, wherein the parity check matrixcomprises an array of circulant sub-matrices.
 8. The decoder of claim 1,wherein the parity check matrix comprises a 2×12 array of circulants ofsize 2048×2048, each circulant having two non-zero diagonals.
 9. Thedecoder of claim 1, wherein the decoder is configured to iterativelydecode the low density parity check codewords based on one of: a sum-product algorithm, a min-sum algorithm, and a scaled min-sum algorithm.10. The decoder of claim 9, wherein the decoder further comprises acontrol unit for: determining if an activation criteria is met during aniteration of the decoding; and responsive to the activation criteriabeing met, configuring the decoder to adaptively quantize messagesprocessed by the decoder.
 11. The decoder of claim 10, whereinadaptively quantizing the input values to the decoder comprises scalinga log-likelihood ratios and the messages by scaling factors to increasea representation range given a fixed number of bits.
 12. The decoder ofclaim 10, wherein determining if the activation criteria is metcomprises determining if a number of unsatisfied check nodes is smallerthan a predetermined threshold.
 13. A method for forward errorcorrection comprising: receiving, by a decoder, a stream of low densityparity check codewords; iteratively decoding the low density paritycheck codewords based on a parity check matrix; for each iteration ofthe decoding, determining if an activation criteria is met; andresponsive to the activation criteria being met for an iteration of thedecoding, configuring the decoder to adaptively quantize messagesprocessed by the decoder based on a scaling factor.
 14. The method ofclaim 13, wherein iteratively decoding the low density parity checkcodewords comprises decoding based on one of: a sum- product algorithm,a min-sum algorithm, and a scaled min-sum algorithm.
 15. The method ofclaim 13, wherein determining if the activation criteria is metcomprises determining if a number of unsatisfied parity checks of theparity checks matrix is smaller than a predetermined threshold.
 16. Themethod of claim 13, where adaptively quantizing the input to the decodercomprises: determining a log-likelihood ratio of received bits of thelow density parity check codewords; receiving a message representingoutput of a prior decoding iteration; scaling the log-likelihood ratioand the message by a scaling factor to increase a representation rangegiven a fixed number of bits; and iteratively decoding the low densityparity check codewords using the scaled log-likelihood ratio and scaledmessage.
 17. A method for generating a quasi-cyclic regularcolumn-partition parity check matrix for forward error correction, themethod comprising: receiving an initial matrix H; determining a count ofcycles of the initial matrix H for each of a plurality of differentcycle lengths; creating a matrix Ĥ as copy of the initial matrix H;modifying the matrix Ĥ based on a modification algorithm; determining acount of cycles of the modified matrix Ĥ for each of the plurality ofdifferent cycle lengths; determining a lowest of the plurality ofdifferent cycle lengths for which the initial matrix H and the modifiedmatrix Ĥ have different counts; and replacing the initial matrix H withthe modified matrix Ĥ responsive to the initial matrix H having a highercount for the lowest of the plurality of different cycle lengths forwhich the initial matrix H and the modified matrix Ĥ have differentcounts.
 18. The method of claim 17, wherein modifying the matrix Ĥ basedon a modification algorithm comprises: pseudo-randomly selecting asub-matrix of the matrix Ĥ; pseudo-randomly selecting a diagonal of theselected sub-matrix of the matrix Ĥ; changing a position of the selecteddiagonal of the selected sub-matrix of the matrix Ĥ to a differentpseudo-randomly selected position.
 19. The method of claim 17, whereindetermining the count of cycles of the initial matrix H for each of theplurality of different cycle lengths comprises: determining an adjacencymatrix of a Tanner Graph of matrix H; determining the count based on theadjacency matrix.